Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

نویسندگان

Yujun Lin

Song Han

Huizi Mao

Yu Wang

William J. Dally

چکیده

Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD are redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during this compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270× to 600× without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distributed Training

متن کامل

Variance-based Gradient Compression for Efficient Distributed Deep Learning

Due to the substantial computational cost, training state-of-the-art deep neural networks for large-scale datasets often requires distributed training using multiple computation workers. However, by nature, workers need to frequently communicate gradients, causing severe bottlenecks, especially on lower bandwidth connections. A few methods have been proposed to compress gradient for efficient c...

متن کامل

Homomorphic Parameter Compression for Distributed Deep Learning Training

Distributed training of deep neural networks has received significant research interest, and its major approaches include implementations on multiple GPUs and clusters. Parallelization can dramatically improve the efficiency of training deep and complicated models with large-scale data. A fundamental barrier against the speedup of DNN training, however, is the trade-off between computation and ...

متن کامل

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradient...

متن کامل

Decentralization Meets Quantization

Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: quantization for low bandwidth networks, and decentralization for high latency networks. In this paper, we explore a natural question: can the combination of both decentralization and quantization lead to a system that...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1712.01887 شماره

صفحات -

تاریخ انتشار 2017

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

نویسندگان

چکیده

منابع مشابه

Distributed Training

Variance-based Gradient Compression for Efficient Distributed Deep Learning

Homomorphic Parameter Compression for Distributed Deep Learning Training

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Decentralization Meets Quantization

عنوان ژورنال:

اشتراک گذاری